-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean factorsSPGMI and stocksCRSP data sets #83
Conversation
@spinnj Thanks for your PR. It is nice to see folks contributing to the software who appear to be using both for work and teaching. Concerning your comments above.
I noticed that you changed some of the original source data names, which is highly inadvisable. This is something that Doug Martin insists on doing, and has ignored both Brian and I's experience and advice on the matter. Therefore, please revert the sector names to their original format asap. Thank you, |
OK. Checking to make sure I understand exactly what you want me to revert - is the concern the renaming of TickerLast column to TickerL or is it the modified Sectors (changing row contents for each security)? |
I'm tired of trying to impose best practices, but varying from vendor names is almost never a good idea, because then the analyst (or student) can't look up what things mean online, and they can't use other people's code that assumes the vendor data will be unmolested. If we want the parsed R objects to have different names, I guess that is arguably OK fropm a pedantic point of view, but even then, use factor labels, and provide a function to change the change the labels for the factor levels, don't change the source data. Changing the names of the input data sources is frankly dumb, because then you can never go back to the original data source and reconcile. You can also never merge new data from the input data source. We've already seen merge problems when Doug wanted to get updated data from S&P or further adjusted things. If people who are supposedly highly familiar with the data set can't merge it cleanly, then how are other people going to feel if they download data from S&P? In a production data pipeline, as I'm sure you and your team are aware, you usually ingest the source data as is, and then create processes for cleaning or labeling or fixing bad data points, outliers, etc. These things are supposed to be documented and repeatable. Your sandbox code is a start, but it isn't documented, and the documentation for the data sources in the package doesn't explain that the data has been altered, how it was altered, and where the code to apply or reverse the alterations may be found. Both Justin and I have voiced concerns several times about renaming things. They won't match S&P's naming conventions or documentation, you can't merge new data, the list is too long to mention. Not to mentio, we are far beyond the days of fixed-width terminals, how does shortening a name make it more readable or usable? I would suggest it is far more common in practice to clean things like badly placed spaces or puntuation in names, or to provide printed names that are "pretty" and usually longer (like full company names instead of data identifiers). Best practice is to create code to make changes, and apply that code in a systematic way. Even with students or a textbook, we should be teaching people how real data sources work, warts and all, not cleaning it all up, giving them artificially perfect toy examples, and then wondering why they fail when they get to industry and 80% of the work is understanding, parsing, and manipulating input data. Like I said, I'm sick of arguing about it where it is clear that there is no intention to adhere to industry best practices. Our time is probably better spent fixing bugs and making improvements in the functions themselves (there are many, see the issues list). |
I haven't been part of any arguments and so I think I've come in to the middle of a conversation here. Overall, understood, can we simply revert everything from my pull request and move forward in your preferred way - I'm not opposed to that in any way. |
Yes, @spinnj it is both the TickerLast to TickerL and modified Sectors labels. As Brian helpfully provided a longer explanation, original names are best for a variety of reasons. |
OK, understood. Best course of action may be to revert the whole thing, let me go back and keep things as they were and match the sector names to the GICS documentation, and I can submit another PR later? |
Done. Just to illustrate a related point to this conversations, I ran the R CMD check after the PR was merged and got a new error in the Examples files(See #43 on this issue).
This is due to "TickerLast" being changed to "TickerL". It will impact a bunch of other files too, since we recycle many of the examples, but the Examples error only prints the first one. The main goal is to get this on CRAN and then have it stay there for many years with minor maintenance. Thus we need to stamp out bugs and eliminate errors, warnings, and notes on build in the existing code base. We don't want to add others in the process if we can help it because we have a lot of other issues to stamp out (documented under the issues tab). In addition, please see the project board here for the strategy I outlined to tackle existing issues and get this on CRAN. I believe your PR will also solve issue #73, which is most welcome, so thanks again. |
This commit broke things, so reverting in #84 was the correct response for now. It is actually very close to mergable. I commented on the cleaning script file directly, but I will summarize my comments here to keep the discussion in the thread. lines 7/8 of the cleaning script use absolute paths. this will break for anyone who doesn't have the code at C:/FA?FactorAnalytics. please change to relative paths from project root. lines 102/103 rename column TickerLast to TickerL. This is a breaking change, as Justin noted above. My personal opinion is that making a breaking change to save three characters isn't necessary. Anyone familiar with the CRSP data set is used to the name TickerLast. If we want to make breaking changes anyway, then the requirement for mergability is to refactor to make it non-breaking. There are ~18 places in the lines 89 and 96 (reordering columns) will make future merges of vendor data break. there are already other breaking changes in the data (like changing the text names of Industry Sectors rather than changing labels for factor levels), so I don't consider this reordering to be a chnage that breaks future merges any more than other previous changes. it doesn't seem necessary, but there's no prevailing reason not to include it in the commit. Sorry for the noise and the (minor) additional work, and thanks for contributing. |
Much appreciated. I will go through it and clean things up as per your recommendations and additionally deal with issue #85.
Jon
…________________________________
From: Brian G. Peterson ***@***.***>
Sent: Sunday, March 27, 2022 9:52 AM
To: braverock/FactorAnalytics ***@***.***>
Cc: Spinney, Jon (Vestcor) ***@***.***>; Mention ***@***.***>
Subject: Re: [braverock/FactorAnalytics] clean factorsSPGMI and stocksCRSP data sets (PR #83)
WARNING: External Email!
This commit broke things, so reverting in #84<#84> was the correct response for now.
It is actually very close to mergable.
I commented on the cleaning script file directly, but I will summarize my comments here to keep the discussion in the thread.
lines 7/8 of the cleaning script use absolute paths. this will break for anyone who doesn't have the code at C:/FA?FactorAnalytics. please change to relative paths from project root.
lines 102/103 rename column TickerLast to TickerL. This is a breaking change, as Justin noted above. My personal opinion is that making a breaking change to save three characters isn't necessary. Anyone familiar with the CRSP data set is used to the name TickerLast.
If we want to make breaking changes anyway, then the requirement for mergability is to refactor to make it non-breaking. There are ~18 places in the R/ directory that reference TickerLast, and about 14 places in the man/ directory. Refactoring would change the .R files first, and regenerate the relevant roxygen files in man/. If this changed all references to TickerLast and R CMD check runs without breaking on TickerLast, then I think this would be mergable.
lines 89 and 96 (reordering columns) will make future merges of vendor data break. there are already other breaking changes in the data (like changing the text names of Industry Sectors rather than changing labels for factor levels), so I don't consider this reordering to be a chnage that breaks future merges any more than other previous changes. it doesn't seem necessary, but there's no prevailing reason not to include it in the commit.
Sorry for the noise and the (minor) additional work, and thanks for contributing.
—
Reply to this email directly, view it on GitHub<#83 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEVGTQG2RVLEHKSVL3QNT7LVCBKZTANCNFSM5RVRHOUA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
Please note our name change to Vestcor Inc. and our new email domain is @Vestcor.org
This e-mail message (including attachments) is confidential and may be privileged. Any unauthorized distribution or disclosure is prohibited. Disclosure to anyone other than the intended recipient does not constitute waiver of privilege. If you have received this e-mail in error, please notify us and delete it, including any attachments, from your computer system and records.
---------------------------------------------------------------------------------------------------------
S'il-vous-plaît noter que nous sommes maintenant Vestcor Inc. et le nouveau domaine de courriel électronique est @Vestcor.org
Ce courriel (y compris les pièces jointes) est confidentiel et peut être privilégié. La distribution ou la divulgation non autorisée de ce courriel est interdite. Sa divulgation à toute personne autre que son destinataire ne constitue pas une renonciation de privilège. Si vous avez reçu ce courriel par erreur, veuillez nous aviser et éliminer ce courriel, ainsi que les pièces jointes, de votre système informatique et de vos dossiers.
|
factorsSPGMI and stocksCRSP had the following issues:
Additionally, certain other changes were made at the request of Doug Martin:
Incorporating this pull request would:
We can be reached at [email protected] and [email protected] for comments/questions/concerns.